Categorical Attribute traNsformation Environment (CANE): A python module for categorical to numeric data preprocessing
نویسندگان
چکیده
Categorical Attribute traNsformation Environment (CANE) is a simpler but powerful data categorical preprocessing Python package. The package valuable since there currently large range of Machine Learning (ML) algorithms that can only be trained using numerical (e.g., Deep Learning, Support Vector Machines) and several real-world ML applications are associated with attributes. Currently, CANE offers three to numeric transformation methods, namely: Percentage Pruned (PCP), Inverse Document Frequency (IDF) One-Hot-Encoding method. Additionally, the module well documented code examples help in its adoption by non expert users.
منابع مشابه
Clustering Mixed Numeric and Categorical Data: A Cluster Ensemble Approach
Clustering is a widely used technique in data mining applications for discovering patterns in underlying data. Most traditional clustering algorithms are limited to handling datasets that contain either numeric or categorical attributes. However, datasets with mixed types of attributes are common in real life data mining applications. In this paper, we propose a novel divide-and-conquer techniq...
متن کاملA Divisive Ordering Algorithm for Mapping Categorical Data to Numeric Data
The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical a...
متن کاملClustering Large Data Sets with Mixed Numeric and Categorical Values
Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we pres...
متن کاملCluster Center Initialization for Categorical Data Using Multiple Attribute Clustering
The K-modes clustering algorithm is well known for its efficiency in clustering large categorical datasets. The K-modes algorithm requires random selection of initial cluster centers (modes) as seed, which leads to the problem that the clustering results are often dependent on the choice of initial cluster centers and non-repeatable cluster structures may be obtained. In this paper, we propose ...
متن کاملSystematic Search for Categorical Attribute-value Data-driven Machine Learning
Optimal Pruning for Unordered Search is a search algorithm that enables complete search through the space of possible disjuncts at the inner level of a covering algorithm. This algorithm takes as inputs an evaluation function, e, a training set, t, and a set of specialisation operators, o. It outputs a set of operators from o that creates a classifier that maximises e with respect to t. While O...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Software impacts
سال: 2022
ISSN: ['2665-9638']
DOI: https://doi.org/10.1016/j.simpa.2022.100359